class: center, middle, inverse, title-slide .title[ # ADIM: Bayesian Spatial Statistics ] .subtitle[ ## Master’s Degree in Data Analysis, Process Improvement and Decision Support Engineering ] .author[ ###
Joaquín Martínez-Minaya, 2024-12-16
VAlencia BAyesian Research Group
Statistical Modeling Ecology Group
Grupo de Ingeniería Estadística Multivariante
] .date[ ###
jmarmin@eio.upv.es
] --- # John Snow's Cholera Map in London 1854' .left-column3[ <div style="margin-top:-30px;"> </div>  ] .right-column3[  ] --- # Outline ## 1. Spatial statistics. Types of spatial data ## 2. Disease mapping ## 3. Geostatistics ## 4. Penalized complexity priors (PC-priors) ## 5. References --- class: inverse, center, middle, animated, rotateInDownRight # 1. Spatial statistics. Types of spatial data --- # Spatial statistics. Types of spatial data .left-column6[ .hlb[Spatial statistics] is defined as the part of statistics which deal with spatial data and study spatial patterns. - .hlb[Lattice or areal data]: observations are taken at a finite number of sites whose whole constitutes the entire study region (discrete space), e.g. number of sick people by provinces. - .hlb[Point pattern]: the interest is study the process which generates the points. e.g. distribution of trees in a mountain. - .hlb[Geostatistical data]: consist of a collection of data in a fixed set locations over a continuous spatial field, e.g. amount of fish in the ocean or presence/absence of a plant in a country. ] .right-column6[ <div style="margin-top:-30px;"> </div>  ] --- class: inverse, center, middle, animated, rotateInDownRight # 2. Disease mapping --- # Oral Cancer mortality in Valencian Region - In this analysis, we study .hlb[oral cancer mortality in the municipalities of the Valencian Region] using a disease mapping model. The aim is to understand the spatial distribution of risks and identify high-risk areas while accounting for variability due to population size and random noise. The variables are: .left-column4[ - .hlb[Obs]: the number of observed deaths from oral cancer in the study period. - .hlb[Exp]: the number of expected deaths, based on population size and age-specific rates. - .hlb[SMR]: the standardized mortality ratio, calculated as `\(\text{SMR} = \frac{\text{Obs}}{\text{Exp}}\cdot 100\)` - **SMR = 100**: Risk is equivalent to the standard population. - **SMR > 100**: **excess risk**. - **SMR < 100**: **reduced risk**. ] .right-column4[ .center[  ] ] --- # The model - A conditional independent .hlb[Poisson] likelihood function is assumed: `$$y_i \sim \text{Poisson}(\lambda_i), \ \ \lambda_i =E_i \rho_i, \ \ \log(\rho_i)=\eta_i \,\,, i=1, \ldots 32 \,\,$$` -- - We assume that `\(\eta_i=\beta_0 + u_i + v_i\)`, being `\(\boldsymbol{u}\)` the .hlb[independent random effect] and `\(\boldsymbol{v}\)` the .hlb[spatially structured random effect]: `$$u_i \sim \mathcal{N}\left(0, \tau_{\boldsymbol{u}}^{-1}\right), \ v_i \mid \boldsymbol{v}_{-i} \sim \mathcal{N} \left(\frac{1}{n_{i}} \sum_{i \sim j} v_j, \frac{1}{n_{i} \tau_{\boldsymbol{v}}}\right)\,\,.$$` In this case `\(\boldsymbol{\theta}=(v_1, \ldots, v_{32}, u_1, \ldots, u_{32})\)`, and `\(\boldsymbol{\theta} \mid \boldsymbol{\psi}\)` is Gaussian distributed. -- - .hlb[Hyperpriors] for the standard deviation parameters `\(\sigma_u\)` and `\(\sigma_v\)` follow uniform priors: `$$\sigma_u, \sigma_v \sim \text{Uniform}(0, \infty)$$` --- # Predicting Risk in Valencia Region .center[  ] --- class: inverse, center, middle, animated, rotateInDownRight # 3. Geostatistics --- # Continuous spaces .left-column5[ - Sometimes, the assumption that the observations have been collected over .hlbred[discrete time] points have to be removed. - The same happen in .hlb[space]. - If we are studying the .hlb[presence of a disease], .hlb[pollution in an area] or the temperature of a country, the locations where the phenomenon of interest is measured are not frequently allocated in a lattice. - Then, we are dealing with .hlb[continuous spaces] in 1D and 2D ] .right-column5[ .center[ .hlb[**Malaria Prevalence** in Mozambique] ] ] --- # Malaria Prevalence in Mozambique - This analysis studies .hlb[malaria prevalence in Mozambique] using a spatial Bayesian model. The goal is to predict malaria risk and evaluate the effects of environmental and demographic covariates. .left-column4[ - .hlb[Examined]: the number of individuals examined for malaria. - .hlb[Positive]: the number of individuals testing positive for malaria. - .hlb[Covariates]: - **Altitude**: Elevation of the study location (in meters). - **Temperature**: Average temperature (in °C). - **Proximity to water bodies**: Distance to the nearest water source (in kilometers). ] .right-column4[ .center[  ] ] --- # Geostatistics. Basis </br> - #### Geostatistical models assume that the observations are correlated. -- </br> - #### They are based on the following principle </br> ##.center[.hlb[Everything is related to everything else, but near things are more related than distant things]] -- - #### So, two close locations tend to .hlb[co-vary] more than those far from each other. --- # Let's be a bit more formal - A random spatial effect `\(w(s)\)` at a location `\(s \in \mathcal{D}\)` can be considered as a .hlb[stochastic process] characterized by a spatial index `\(s\)` which varies continuously in the fixed domain `\(\mathcal{D}\)`, where `\(\mathcal{D}\)` is a fixed subset of `\(r\)`-dimensional Euclidean space. -- - The spatial process `\(w(s)\)` is Gaussian if for any `\(n \geq 1\)` and any set of sites `\(s = \{s_1, \ldots, s_n\}\)`, `\(w = \{w(s_1 ), \ldots, w(s_n)\}\)` has a multivariate normal distribution with mean `\(\mu = E(w(s))\)` and a structured covariance matrix `\(\Sigma\)`. Usually `\(\mu\)` is assumed to be `\(\boldsymbol{0}\)`. In the literature, this process is widely known as a .hlb[Gaussian field (GF)]. -- - The key issue in spatial statistics is the covariance function `\(\mathcal{C}\)`, which determines the covariance between random variables in two different points. If `\(s_i\)` and `\(s_j\)` are two locations in space, then the .hlb[covariance function] is defined as `$$\mathcal{C}(w(s_i), w(s_j)) = Cov(w(s_i), w(s_j))$$` - It defines the covariance matrix `\(\boldsymbol{\Sigma}\)` of the GF. Each element of the matrix `\(\boldsymbol{\Sigma}_{ij}\)` is defined as: `$$\boldsymbol{\Sigma}_{ij} = \mathcal{C}(w(s_i), w(s_j))$$` --- # Matérn - .hlb[Stationarity]. We say that the GF is second-order stationary if `\(\mu(s) = \mu\)` and `\(Cov(w(s), w(s + h)) = \mathcal{C}(h)\)` for all `\(h \in \mathcal{R}\)` such that `\(s\)` and `\(s + h\)` lie within `\(\mathcal{D}\)`. The covariance function in two different locations depends on the distance vector between these two locations. - An example could be the spread of a pathogen in plants. If there is a road close to the crop, maybe this pathogen could spread faster along the road in cars or trucks than in the crop, it would depend on the direction. -- - .hlb[Isotropy]. We say that the GF is isotropic if the covariance function depends only on the Euclidean distance between points, i.e., `\(Cov(w(s), w(s + h)) = C(||h||)\)`. - For instance, if we think again in the spread of a pathogen in a crop, it would mean that the spread does not depend on the direction, just on the distance. -- - .hlb[Matérn correlation] function is very common. `$$\mathcal{C}(||h||) = \sigma_{\boldsymbol{w}}^2 \left(\frac{\sqrt{8}}{\phi} ||h||\right) K_1 \left(\frac{\sqrt{8}}{\phi} ||h||\right)$$` --- # Matérn correlation function .center[ ] <!-- # Spatial component --> <!-- ## .hlb[Why is important to add the spatial component in our models?] --> <!-- ### 1. Account for the .hlb[spatial autocorrelation] in our model --> <!-- - .hlb[Better estimation] between the relationship of the response and explicative variables; --> <!-- - .hlb[Better predictions]. --> <!-- ### 2. When you add the spatial component as another variable in your model you can .hlb[map] it. --> <!-- - The spatial effect indicate the .hlb[spatial intrinsic variability] of the data after the exclusion of the others explicative data, --> <!-- - It could be a very useful tool as it could highlight different .hlb[spatial pattern] about a species distribution. --> --- # Geostatistics in the context of LGMs <font size="+2"> .hlb[Likelihood] </font> - A conditional independent .hlb[Binomial likelihood] function is assumed: `$$y_i \mid \pi_i \sim \text{Binomial}(n_i, \pi_i), \ \eta_i = \text{logit} (\pi_i)=\beta_0 + \beta_1 Temp + w_i \,\,, i=1, \ldots 447 \,\,$$` -- <font size="+2"> .hlb[Latent Gaussian field] </font> `$$\boldsymbol{w} \sim \mathcal{N}(0, \boldsymbol{\Sigma}(\sigma_{\boldsymbol{w}}, \phi)), \ \beta_j \sim \mathcal{N}(0, \tau = 0.001)$$` `\(\boldsymbol{\theta}=(\beta_0, \beta_1, w_1, \ldots, w_{447})\)`, and `\(\boldsymbol{\theta} \mid \boldsymbol{\psi}\)` is Gaussian distributed. - `\(\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma(\sigma_{\boldsymbol{w}}, \phi)})\)`, i.e., the spatial effect is assumed to be a .hlb[continuous Gaussian field (GF)] with Matérn covariance structure, where: - `\(\Sigma(\sigma_{\boldsymbol{w}}, \phi)\)` is a .hlb[covariance matrix] depending on the distance between locations, `\(\sigma_{\boldsymbol{w}}\)` is the .hlb[variance] of the spatial effect, and `\(\phi\)` is the .hlb[range] of the spatial effect. -- <font size="+2"> .hlb[Hyperparameters] </font> `\(\boldsymbol{\psi} = (\sigma_{\boldsymbol{w}}, \phi)\)` --- </br> ## .center[.hlbred[Problem: INLA can not fit continuous GFs]] -- </br> ## .center[.hlb[Solution: approximate the continuous GFs using the Stochastic Partial Differential Equation approach (SPDE) ]] --- # The SPDE approach .left-column2[ ### .center[.hlb[Likelihood]] `$$y_i \mid \pi_i \sim \text{Ber}(\pi_i) \,$$` `$$\text{logit}(\pi_i) = \beta_0 + \beta_1 Temp + w_i \,$$` ### .center[.hlb[Latent Gaussian field]] `$$\boldsymbol{\beta} \sim \mathcal{N(\boldsymbol{0}, \tau = 0.0001)}$$` `$$\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{\Sigma(\sigma_{\boldsymbol{w}}, \phi)})$$` ### .center[.hlb[Hyperparameters]] `$$p(\sigma_{\boldsymbol{w}}, \phi)$$` ] .right-column2[ ### .center[.hlb[Likelihood]] `$$y_i \mid \pi_i \sim \text{Ber}(\pi_i) \,$$` `$$\text{logit}(\pi_i) = \beta_0 + \beta_1 Temp + w_i \,$$` ### .center[.hlb[Latent Gaussian field]] `$$\boldsymbol{\beta} \sim \mathcal{N}(\boldsymbol{0}, \tau = 0.0001)$$` `$$\boldsymbol{w} \sim \mathcal{N}(\boldsymbol{0}, \boldsymbol{Q^{-1}(\sigma_{\boldsymbol{w}}, \phi)})$$` ### .center[.hlb[Hyperparameters]] `$$p(\sigma_{\boldsymbol{w}}, \phi)$$` ] --- # How is the approximation conducted?  --- # Malaria Prevalence in Mozambique .center[  ] --- class: inverse, center, middle, animated, slideInLeft # 4. Penalized complexity priors (PC-priors) --- # Penalizing departure from the base model - Simpson et al. (2017) propose priors that penalize departure from a base model and for this reason they are called .hlb[Penalized Complexity (PC) priors]. -- - The prior favors the base model unless evidence is provided against it, following the principle of parsimony. -- - Distance from the base model is measured using the .hlb[Kullback-Leibler] distance, and penalization from the base model is done at a .hlb[constant rate on the distance]. -- - Finally, the PC prior is defined using .hlb[probability statements] on the model parameters in the appropriate scale. --- # Hyperpriors for the standard deviation in an iid - The .hlb[PC-prior for the precision] `\(\tau\)` has density: `$$p(\tau) = \frac{\lambda}{2} \tau^{-3/2} \exp(-\lambda \tau^{-1/2}) \,, \ \tau > 0 \,,$$` where $$ \lambda = - \frac{ln(\alpha)}{u} \,,$$ and `\((u, \alpha)\)` are the parameters to this prior. The interpretation of `\((u, \alpha)\)` is that: `$$Prob(\sigma > u) = \alpha \,, \ u>0 \,, \ 0< \alpha < 1 \,.$$` -- - Functions `inla.pc.{d,p,q,r}.prec` allow us to .hlb[deal with this priors]. -- - If we want to plot the prior in terms of the .hlb[standard deviation] `\(\sigma\)`, remember that using function `inla.tmarginal` we can go from the `\(\tau\)` parameter to `\(\sigma\)` parameter. --- # Hyperpriors for the standard deviation in an iid. `\(sigma = 1\)` .center[ ] --- # Spatial effect: priors - The PC-prior for the .hlb[range] is defined in terms of `\(\phi_0\)` and `\(p_1\)` so that `$$Prob(\phi < \phi_0) = p_1$$` -- - The PC-prior for the .hlb[standard deviation] is defined in terms of `\(\sigma_0\)` and `\(p_2\)` so that `$$Prob(\sigma_{\boldsymbol{w}} > \sigma_0) = p_2$$` -- - In order to define the SPDE using PC-priors, the following command have to be used: ``` r spde <- inla.spde2.pcmatern( mesh = ..., prior.range = c(phi0, p1), prior.sigma = c(sigma0, p2)) ``` --- class: inverse, center, middle, animated, slideInRight # 5. References --- # This material has been constructed based on: - Moraga, P., Dean, C., Inoue, J., Morawiecki, P., Noureen, S. R., & Wang, F. (2021). Bayesian spatial modelling of geostatistical data using INLA and SPDE methods: A case study predicting malaria risk in Mozambique. Spatial and Spatio-temporal Epidemiology, 39, 100440. - Blangiardo, M., & Cameletti, M. (2015). Spatial and spatio-temporal Bayesian models with R-INLA. John Wiley & Sons. - Fuglstad, G. A., Simpson, D., Lindgren, F., & Rue, H. (2019). Constructing priors that penalize the complexity of Gaussian random fields. Journal of the American Statistical Association, 114(525), 445-452. - <a href="https://www.r-inla.org/examples-tutorials" style="color:#FF0000;"> INLA tutorials </a> - <a href="https://becarioprecario.bitbucket.io/inla-gitbook/index.html" style="color:#FF0000;"> INLA book by Virgilio Gómez-Rúbio </a> - <a href="https://www.paulamoraga.com/book-geospatial/" style="color:#FF0000;"> INLA book by Paula Moraga </a> - <href="https://becarioprecario.bitbucket.io/spde-gitbook/" style="color:#FF0000;"> SPDE book by Krainski et al. </a> --- class: inverse, left, middle, animated, bounceInDown </br> # ADIM: Bayesian Spatial Statistics ## Master's Degree in Data Analysis, Process Improvement and Decision Support Engineering </br> <font size="6"> Joaquín Martínez-Minaya, 2024-12-16 </font> </br> <a href="http://vabar.es/" style="color:white;" > VAlencia BAyesian Research Group </a> </br> <a href="https://smeg-bayes.org/ " style="color:white;"> Statistical Modeling Ecology Group </a> </br> <a href="https://giem.blogs.upv.es/" style="color:white;"> Grupo de Ingeniería Estadística Multivariante </a> </br> <a href="jmarmin@eio.upv.es" style="color:white;"> jmarmin@eio.upv.es </a>